ENH: add validate argument to merge #16275

nickeubank · 2017-05-07T01:19:30Z

closes ENH: option to check merge is one-to-one, many-to-one, one-to-many, or many-to-many #16270
tests added / passed
passes git diff upstream/master --name-only -- '*.py' | flake8 --diff
whatsnew entry

Open question: does not currently check for whether "many" side of a merge is in fact many. I can't think of a case where I'm doing a one-to-many merge and I'd be worried if it was in fact one-to-one, but open to alternate opinions.

jreback · 2017-05-07T01:26:31Z

pandas/tests/reshape/test_merge.py

+        right_w_dups = right.append(pd.DataFrame({'a':['e'], 'c':['moo']}, index=4))
+        merge(left, right_w_dups, left_index=True, right_index=True, validate='one_to_many')
+
+        def f():


use the with syntax

with pytest.raises(..):
...

jreback · 2017-05-07T01:27:15Z

doc/source/whatsnew/v0.21.0.txt

+.. _whatsnew_0210.enhancements.merge_validate:
+
+``merge`` supports ``validate`` argument to test merge
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


this is a bit verbose

jreback · 2017-05-07T01:28:12Z

pandas/core/reshape/merge.py

+
+
+        if self.validate == "one_to_one" or self.validate == "1:1":
+            if not left_unique and not right_unique:


use in [..., ...]

codecov · 2017-05-07T18:58:41Z

Codecov Report

Merging #16275 into master will decrease coverage by 0.01%.
The diff coverage is 92.59%.

@@            Coverage Diff             @@
##           master   #16275      +/-   ##
==========================================
- Coverage   90.34%   90.33%   -0.02%     
==========================================
  Files         167      167              
  Lines       50908    50935      +27     
==========================================
+ Hits        45994    46011      +17     
- Misses       4914     4924      +10

Flag	Coverage Δ
#multiple	`88.12% <92.59%> (ø)`	⬆️
#single	`40.27% <3.7%> (-0.13%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/frame.py	`97.59% <ø> (-0.1%)`	⬇️
pandas/core/reshape/merge.py	`93.89% <92.59%> (-0.06%)`	⬇️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/common.py	`91.03% <0%> (+0.34%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ea56550...e5a6ee1. Read the comment docs.

codecov · 2017-05-07T18:58:43Z

Codecov Report

Merging #16275 into master will increase coverage by 0.03%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #16275      +/-   ##
==========================================
+ Coverage   90.38%   90.41%   +0.03%     
==========================================
  Files         161      161              
  Lines       50999    51024      +25     
==========================================
+ Hits        46096    46134      +38     
+ Misses       4903     4890      -13

Flag	Coverage Δ
#multiple	`88.25% <100%> (+0.03%)`	⬆️
#single	`40.17% <4%> (-0.02%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/frame.py	`97.69% <ø> (ø)`	⬆️
pandas/core/reshape/merge.py	`94.19% <100%> (+0.24%)`	⬆️
pandas/core/indexes/datetimes.py	`95.33% <0%> (+0.09%)`	⬆️
pandas/plotting/_converter.py	`65.05% <0%> (+1.81%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a6fcec6...3488f00. Read the comment docs.

nickeubank · 2017-05-07T20:04:01Z

@jreback I'm not sure the code in the PR checklist (git diff upstream/master --name-only -- '*.py' | flake8 --diff) works right. I don't get any errors with that, but clearly have flake8 errors (as evident from Travis tests).

jreback · 2017-05-07T20:12:41Z

http://pandas.pydata.org/pandas-docs/stable/contributing.html#python-pep8

nickeubank · 2017-05-07T20:20:58Z

@jreback thanks -- but on my system I got no errors, when I had trailing white space and excessive empty lines between lines. I think the command should be git diff master -- '*.py' |flake8 --diff, no?

jreback · 2017-05-07T20:23:59Z

if u read the doc it explains what upstream/master is

nickeubank · 2017-05-07T20:25:45Z

I'm fine with upstream/master (I was copying off doc which just has master). I think the problem is the --name-only option. So should be git diff upstream/master -- '*.py' | flake8 --diff.

jreback · 2017-05-07T20:28:05Z

This command will catch any stylistic errors in your changes specifically, but be beware it may not catch all of them. For example, if you delete the only usage of an imported function, it is stylistically incorrect to import an unused function. However, style-checking the diff will not catch this because the actual import is not part of the diff. Thus, for completeness, you should run this command, though it will take longer:

git diff master --name-only -- '*.py' | grep 'pandas/' | xargs -r flake8

nickeubank · 2017-05-07T20:33:40Z

It's not even catching local stylistic issues. May be an order of arguments problem.

jreback · 2017-05-07T20:36:42Z

I have this an alias and it works like a charm: git diff master --name-only -- '*.py' | grep 'pandas/' | xargs -r flake8

yeah arg ordering does sometimes matter.

nickeubank · 2017-05-07T20:37:26Z

So maybe we need to change the check box command. Should i open as issue?

jreback · 2017-05-07T20:40:04Z

you can just do a PR (also make sure the constributing docs are consistent).

jreback · 2017-05-07T21:07:18Z

pandas/core/frame.py

@@ -174,6 +174,17 @@

    .. versionadded:: 0.17.0

+validate: None or string, default None
+    If specified, checks to ensure merge is of specified type.


I think you need - or * on these? @jorisvandenbossche ?

jreback · 2017-05-07T21:07:37Z

pandas/core/frame.py

+validate: None or string, default None
+    If specified, checks to ensure merge is of specified type.
+    If "one_to_one" or "1:1", checks merge keys are unique in both
+    left and right dataset.


checks if merge keys (and same on all)

jreback · 2017-05-07T21:13:15Z

pandas/core/reshape/merge.py

@@ -952,6 +959,51 @@ def _validate_specification(self):
        if len(self.right_on) != len(self.left_on):
            raise ValueError("len(right_on) must equal len(left_on)")

+    def _validate(self):
+        # Get merging series:


simply use self.left_join_keys and self.right_join_keys which are already lists of arrays (1 or more). No need to repeat this logic.

I tried that, but if I want to use duplicated(), it ends up taking just as much additional code (or more) to work with self.right_join_keys or self.left_join_keys -- I have to coerce them into DataFrames in a way that's amenable to their variable potential dimensions, then run duplicates. I don't think that's actually any better.

show what u r doing

The alternate? Shortest safe one I have is:

pd.concat(list(map(pd.Series, left_keys)), axis='columns')

jreback · 2017-05-07T21:13:58Z

pandas/core/reshape/merge.py

@@ -561,6 +565,9 @@ def __init__(self, left, right, how='inner', on=None,
        # to avoid incompat dtypes
        self._maybe_coerce_merge_keys()

+        if self.validate is not None:


add a comment here on what is happening, pass in validate and don't make it an instance variable. This will just raise if there is an error.

jreback · 2017-05-07T21:14:35Z

pandas/tests/reshape/test_merge.py

@@ -724,6 +724,63 @@ def test_indicator(self):
                          how='outer', indicator=True)
        assert_frame_equal(test5, hand_coded_result)

+    def test_validation(self):
+        left = DataFrame({'a': ['a', 'b', 'c', 'd'],


add tests for merge_asof this applies to it as well (different file).

jreback · 2017-05-07T21:15:36Z

doc/source/whatsnew/v0.21.0.txt

+
+The ``validate`` argument for :func:`merge` function now checks whether a merge is
+one-to-one, one-to-many, many-to-one, or many-to-many. If a merge is found to not 
+be an example of specified merge type, an exception will be raised. 


issue number at the end. You can just add this to Other Enhancements, doesn't need a sub-section (unless you want to add a full-on example).

jreback · 2017-05-07T21:16:38Z

can you add a small section in merge docs (and link from the doc-string to this). Also an example in the doc-string would be good as well. Finally there is an existing warning in the docs somewhat about this (e.g. checking for duplicates), can you revise.

jreback · 2017-05-07T23:17:32Z

doc/source/merging.rst

-  may result in memory overflow. It is the user' s responsibility to manage duplicate values in keys before joining large DataFrames.
+  Joining / merging on duplicate keys can cause a returned frame that is the multiplication of the row dimensions, which may result in memory overflow. It is the user' s responsibility to manage duplicate values in keys before joining large DataFrames.
+
+Checking for duplicate keys


versionadded tag, add a ref tag as well (e.g. _merging.validation::)

jreback · 2017-05-07T23:18:20Z

doc/source/merging.rst

+
+   right = pd.DataFrame({'A' : [4,5,6], 'B': [2, 2, 2]})
+
+   result = pd.merge(left, right, on='B', how='outer', validate="one_to_one");


this raises, so better to use a code-block and show what this is going to raise.

jreback · 2017-05-07T23:19:41Z

pandas/tests/reshape/test_merge.py

@@ -724,6 +724,63 @@ def test_indicator(self):
                          how='outer', indicator=True)
        assert_frame_equal(test5, hand_coded_result)

+    def test_validation(self):


rather than repeating tests I think pull these out to another test file, test_api and run thru both merge and merge_asof (parametrize is best)

The only issue I see is that there are requirements for merge_as that don't apply to merge (for example, merge_as requires sorting). I can do this, but will limit set of tests I can run to those that work on both (current tests for merge don't work for merge_asof)

maybe just do some simple validation then (and leave them where they are). having completely duplicated tests is not that useful.

ok. Added some simple validations to merge_asof

jreback · 2017-05-07T23:26:27Z

pandas/core/reshape/merge.py

+        if self.right_index:
+            right_unique = not (self.orig_right.index.duplicated()).any()
+        else:
+            right_unique = not (self.orig_right[right_key].duplicated()).any()


something like

MultiIndex.from_array(self.left_join_keys).is_unique is more clear / faster.

Not convinced that's more readable then just having that original pandas code, but will put in

@nickeubank the point is you are repeating lots of code for no reason. The notion of figuring out which is what (from the input args) is actually non-trivial and will be refactored shortly. Don't repeat it.

see #16228 for the issue

crayxt · 2017-05-10T04:11:13Z

doc/source/merging.rst

+- ``validate`` : {None, '1:1', '1:m', 'm:1', 'm:m', "one_to_one", "one_to_many", "many_to_one", "many_to_many"}, default None
+  If specified, checks if merge is of specified type.
+  * "one_to_one" or "1:1": check if merge keys are unique in both
+  left and right dataset.


jreback · 2017-05-10T11:18:13Z

doc/source/merging.rst

@@ -551,6 +552,18 @@ standard database join operations between DataFrame objects:

  .. versionadded:: 0.17.0

+- ``validate`` : {None, '1:1', '1:m', 'm:1', 'm:m', "one_to_one", "one_to_many", "many_to_one", "many_to_many"}, default None


I would split this up and put the choices on the 2nd line, otherwise this gets too long (or you can use \ to split the line). maybe check how this all renders.

jreback · 2017-05-10T11:19:20Z

doc/source/merging.rst

+In the following example, there are duplicate values of ``B`` in the right DataFrame. As this is not a one-to-one merge -- as specified in the ``validate`` argument -- an exception will be raised.
+
+.. code-block:: python
+


when using code-block, need to copy-past exactly from an ipython section (e.g. formatting counts).

jreback · 2017-05-10T11:20:17Z

doc/source/merging.rst

+
+If the user is aware of the duplicates in the right `DataFrame` but wants to ensure there are no duplicates in the left DataFrame, one can use the `one_to_many` argument instead, which will not raise an exception. 
+
+.. ipython:: python


better to move this block before the other one and actually run it, then the previous code-block only needs to show the merge operation (as you will have already shown the df when you remove the suppress)

jreback · 2017-05-10T11:20:47Z

pandas/core/frame.py

+validate : {None, '1:1', '1:m', 'm:1', 'm:m', "one_to_one", "one_to_many",
+            "many_to_one", "many_to_many"}, default None
+    If specified, checks if merge is of specified type.
+    * "one_to_one" or "1:1": check if merge keys are unique in both


I think you need to indent the 2nd row after the * @jorisvandenbossche ?

also needs a blank line above this one (to delineate the itemize block)

jreback · 2017-05-10T11:21:56Z

pandas/core/reshape/merge.py

@@ -341,6 +344,19 @@ def merge_asof(left, right, on=None,

        .. versionadded:: 0.20.0

+    validate : {None, '1:1', '1:m', 'm:1', 'm:m', "one_to_one", "one_to_many",
+                "many_to_one", "many_to_many"}, default None


same. as an aside we ought to use a template for these merge* doc-strings

jreback · 2017-05-10T11:22:41Z

pandas/core/reshape/merge.py

+    def _validate(self, validate):
+
+        # Check uniqueness of each
+        if self.left_index:


not self.orig_left.index.is_unique

jreback · 2017-05-10T11:22:50Z

pandas/core/reshape/merge.py

+            left_unique = MultiIndex.from_arrays(self.left_join_keys
+                                                 ).is_unique
+
+        if self.right_index:


jreback · 2017-05-10T11:23:50Z

pandas/tests/reshape/test_merge.py

+                                 'nay', 'chirp']},
+                          index=range(5))
+
+        merge(left, right, left_index=True, right_index=True, validate='1:1')


comment on why you are not checking the result values

Good point. I actually should test those come to think of it -- in theory no impact on output (it's just tests), but no reason to make that a tested behavior.

jorisvandenbossche

I am not sure if I find it making sense to also have this keyword for the merge_asof, as you do not match on exact keys anyway, but on the nearest.
Further, this also gets complicated because eg the allow_exact_matches can actually influence the fact there would be duplicate keys or not.
Further, the explanation should also be different, as you don't have the same concept of a "one_to_many" merge as you have with the normal merge, because even if you have duplicate keys, you end up with a one_to_one kind of merge

jorisvandenbossche · 2017-05-10T14:00:41Z

doc/source/merging.rst

@@ -551,6 +552,18 @@ standard database join operations between DataFrame objects:

  .. versionadded:: 0.17.0

+- ``validate`` : {None, '1:1', '1:m', 'm:1', 'm:m', "one_to_one", "one_to_many", "many_to_one", "many_to_many"}, default None
+  If specified, checks if merge is of specified type.
+  * "one_to_one" or "1:1": check if merge keys are unique in both


Needs a blank line above this one (to delineate the itemize)

jorisvandenbossche · 2017-05-10T14:01:09Z

doc/source/merging.rst

+- ``validate`` : {None, '1:1', '1:m', 'm:1', 'm:m', "one_to_one", "one_to_many", "many_to_one", "many_to_many"}, default None
+  If specified, checks if merge is of specified type.
+  * "one_to_one" or "1:1": check if merge keys are unique in both
+  left and right dataset.


this has to be indented at the same level as "one...

and the same for the two items below

jorisvandenbossche · 2017-05-10T14:23:04Z

pandas/core/frame.py

+validate : {None, '1:1', '1:m', 'm:1', 'm:m', "one_to_one", "one_to_many",
+            "many_to_one", "many_to_many"}, default None
+    If specified, checks if merge is of specified type.
+    * "one_to_one" or "1:1": check if merge keys are unique in both


jorisvandenbossche · 2017-05-10T14:23:24Z

pandas/core/frame.py

+validate : {None, '1:1', '1:m', 'm:1', 'm:m', "one_to_one", "one_to_many",
+            "many_to_one", "many_to_many"}, default None
+    If specified, checks if merge is of specified type.
+    * "one_to_one" or "1:1": check if merge keys are unique in both


also needs a blank line above this one (to delineate the itemize block)

jorisvandenbossche · 2017-05-10T14:24:06Z

pandas/core/frame.py

@@ -174,6 +174,18 @@

    .. versionadded:: 0.17.0

+validate : {None, '1:1', '1:m', 'm:1', 'm:m', "one_to_one", "one_to_many",
+            "many_to_one", "many_to_many"}, default None


many_to_many is listed as a possibility, but is not explained in the list below

The formatting here (multiple lines for the type specification, to render nice in the html docs it has to fit on one line) currently does not work nice with numpydoc (see numpy/numpydoc#87)

Putting just string, default None and then the options are in the list below is an alternative

@jorisvandenbossche This relates to the question at the top of this PR: should we check the manys? In other words, if someone passed "one_to_many" but the merge is actually "one_to_one", is that ok?

My sense is that's fine -- i've never been in a situation where getting 1:1 where I expected 1:m was a problem -- but one corollary is that "many_to_many" doesn't actually do anything. I like the idea of leaving it there -- it's nice to be able to label ones merges consistently in this way -- but that specification doesn't imply any tests (which is why I left it off the list).

Yes, I agree with the idea of not checking that a 'many' is in practice a 'one'. So this means that 'many_to_many' should check nothing.

But apart from that, we should either not have it as an option, or either include it in the docstring (and just say that this does do nothing)

I see you already did that in the meantime!

jorisvandenbossche · 2017-05-10T14:32:41Z

pandas/core/reshape/merge.py

+                                                  ).is_unique
+
+        # Check valid arg
+        if validate not in ['one_to_one', '1:1',


can also put this in the end as an else clause, then don't need to list all options here again

jorisvandenbossche · 2017-05-10T14:33:00Z

pandas/core/reshape/merge.py

+                raise ValueError("Merge keys are not unique in right dataset;"
+                                 " not a one-to-one merge")
+
+        if validate in ["one_to_many", "1:m"]:


if -> elif (the same below) (makes it clearer code-wise they are exclusive if statements)

jorisvandenbossche · 2017-05-10T14:45:42Z

pandas/tests/reshape/test_merge.py

+        # Check invalid arguments
+        with pytest.raises(ValueError):
+            merge(left, right, on='a', validate='jibberish')
+


can you also add tests for when merging on multiple columns (and maybe also one test where combining eg left_index and right_on, but not needed to that for all of them)

nickeubank

I'm fine with dropping from merge_asof. @jreback you ok with that?

jreback · 2017-05-11T01:09:14Z

yep

jorisvandenbossche · 2017-05-12T16:41:22Z

doc/source/merging.rst

+    dataset.
+  * "many_to_one" or "m:1": checks if merge keys are unique in right
+    dataset.
+  * "many_to_may" or "m:m": allowed, but does not result in checks. 


may -> many

jorisvandenbossche · 2017-05-12T16:42:01Z

doc/source/merging.rst

+      988             elif not right_unique:
+  --> 989                 raise ValueError("Merge keys are not unique in right dataset;"
+      990                                  " not a one-to-one merge")
+      991 


You can truncate this to only the final 'ValueError: .... ' line

jorisvandenbossche · 2017-05-12T16:43:13Z

doc/source/whatsnew/v0.21.0.txt

+``validate`` argument checks merge key uniqueness
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The ``validate`` argument for :func:`merge` function now checks whether a merge is


for the merge function

and maybe "the optional validate argument" (to make clear that is not the default to check something)

jorisvandenbossche · 2017-05-12T16:43:41Z

pandas/core/frame.py

@@ -174,6 +174,19 @@

    .. versionadded:: 0.17.0

+validate : String, default None


(minor nitpick) String -> string

nickeubank · 2017-05-19T20:29:13Z

@jorisvandenbossche @jreback Think I covered all suggestions!

jreback

some doc/style comments

jreback · 2017-05-20T15:12:12Z

doc/source/merging.rst

@@ -551,6 +552,21 @@ standard database join operations between DataFrame objects:

  .. versionadded:: 0.17.0

+- ``validate`` : String, default None
+  If specified, checks if merge is of specified type.


does this render? I think it needs indenting

jreback · 2017-05-20T15:13:17Z

doc/source/merging.rst

+  left = pd.DataFrame({'A' : [1,2], 'B' : [1, 2]})
+  right = pd.DataFrame({'A' : [4,5,6], 'B': [2, 2, 2]})
+
+.. code-block:: python


this needs to look like ipython output (the code block)

tweaked. jorvis said to suppress most of the traceback, but added the In [] and Out []. If you need anything else, please specify.

jreback · 2017-05-20T15:13:50Z

doc/source/merging.rst

+
+  ValueError: Merge keys are not unique in right dataset; not a one-to-one merge    
+
+If the user is aware of the duplicates in the right `DataFrame` but wants to ensure there are no duplicates in the left DataFrame, one can use the `one_to_many` argument instead, which will not raise an exception. 


use validate='one_to_many'

jreback · 2017-05-20T15:14:11Z

doc/source/whatsnew/v0.21.0.txt

+
+- The ``validate`` argument for :func:`merge` function now checks whether a merge is
+one-to-one, one-to-many, many-to-one, or many-to-many. If a merge is found to not 
+be an example of specified merge type, an exception will be raised. (:issue:`16270`)


add a link to the doc-section

jreback · 2017-05-20T15:16:35Z

pandas/core/reshape/merge.py

+        # Check data integrity
+        if validate in ["one_to_one", "1:1"]:
+            if not left_unique and not right_unique:
+                raise ValueError("Merge keys are not unique in either left"


we might want to raise a MergeError (inherits from ValueError) instead here (as this is what we do for the other operations). Have a look and see if this is inconsistent

Converted. I think that's fair.

jorisvandenbossche · 2017-05-22T08:39:32Z

@nickeubank Thanks!

jorisvandenbossche · 2017-05-22T08:42:57Z

@nickeubank I just noticed the MergeError comment, which I think you actually did nod not change? (only in the doc)

So maybe we can do a follow-up PR if we want to change this from a ValueError to a MergeError. In any case, if we use this error, we also have to move it to pandas.errors module for public exposure. And mention it in the docstring that this error is raised.

jreback · 2017-05-22T12:14:10Z

I agree. let's change ValueError -> MergeError (and then add MergeError to pandas.errors); also may need a doc-string :>

nickeubank · 2017-05-22T21:00:43Z

@jreback @jorisvandenbossche ugh, sorry -- I updated it but apparently didn't push the commit. Will open new PR.

jreback reviewed May 7, 2017

View reviewed changes

jreback added Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels May 7, 2017

nickeubank force-pushed the check_merge branch 3 times, most recently from 8874c7d to e5a6ee1 Compare May 7, 2017 18:58

nickeubank force-pushed the check_merge branch from e5a6ee1 to 95fadb7 Compare May 7, 2017 20:03

jreback reviewed May 7, 2017

View reviewed changes

jreback requested changes May 7, 2017

View reviewed changes

jreback reviewed May 7, 2017

View reviewed changes

nickeubank force-pushed the check_merge branch from 4e3c138 to 081bd87 Compare May 7, 2017 23:13

jreback reviewed May 7, 2017

View reviewed changes

nickeubank force-pushed the check_merge branch from f3e565d to 496e915 Compare May 8, 2017 01:12

crayxt reviewed May 10, 2017

View reviewed changes

jreback reviewed May 10, 2017

View reviewed changes

jreback requested changes May 10, 2017

View reviewed changes

jorisvandenbossche reviewed May 10, 2017

View reviewed changes

nickeubank commented May 10, 2017

View reviewed changes

nickeubank force-pushed the check_merge branch 2 times, most recently from 9c24834 to d7de5fa Compare May 11, 2017 05:33

jorisvandenbossche reviewed May 12, 2017

View reviewed changes

nickeubank force-pushed the check_merge branch 2 times, most recently from 8b8f3b0 to 0c2ef6f Compare May 19, 2017 19:57

jreback approved these changes May 20, 2017

View reviewed changes

add validate argument to merge

b2e87ea

nickeubank force-pushed the check_merge branch from 0c2ef6f to b2e87ea Compare May 21, 2017 16:58

small fixup

3488f00

jorisvandenbossche changed the title ~~add validate argument to merge~~ ENH: add validate argument to merge May 22, 2017

jorisvandenbossche merged commit b297983 into pandas-dev:master May 22, 2017

jorisvandenbossche added this to the 0.21.0 milestone May 22, 2017

pcluo pushed a commit to pcluo/pandas that referenced this pull request May 22, 2017

ENH: add validate argument to merge (pandas-dev#16275)

c94503e

nickeubank mentioned this pull request May 22, 2017

change merge validate errors to MergeError from ValueError #16436

Merged

stangirala pushed a commit to stangirala/pandas that referenced this pull request Jun 11, 2017

ENH: add validate argument to merge (pandas-dev#16275)

13ce1f5

simonjayhawkins mentioned this pull request Jul 22, 2020

validate ='m:m' in dataframe.merge should have a meaningful purpose #27430

Closed



		if self.validate == "one_to_one" or self.validate == "1:1":
		if not left_unique and not right_unique:


		right = pd.DataFrame({'A' : [4,5,6], 'B': [2, 2, 2]})

		result = pd.merge(left, right, on='B', how='outer', validate="one_to_one");

		@@ -551,6 +552,18 @@ standard database join operations between DataFrame objects:

		.. versionadded:: 0.17.0

		- ``validate`` : {None, '1:1', '1:m', 'm:1', 'm:m', "one_to_one", "one_to_many", "many_to_one", "many_to_many"}, default None

		In the following example, there are duplicate values of ``B`` in the right DataFrame. As this is not a one-to-one merge -- as specified in the ``validate`` argument -- an exception will be raised.

		.. code-block:: python


		If the user is aware of the duplicates in the right `DataFrame` but wants to ensure there are no duplicates in the left DataFrame, one can use the `one_to_many` argument instead, which will not raise an exception.

		.. ipython:: python

		@@ -174,6 +174,19 @@

		.. versionadded:: 0.17.0

		validate : String, default None


		ValueError: Merge keys are not unique in right dataset; not a one-to-one merge

		If the user is aware of the duplicates in the right `DataFrame` but wants to ensure there are no duplicates in the left DataFrame, one can use the `one_to_many` argument instead, which will not raise an exception.

ENH: add validate argument to merge #16275

ENH: add validate argument to merge #16275

Conversation

nickeubank commented May 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented May 7, 2017

Codecov Report

codecov bot commented May 7, 2017 • edited Loading

Codecov Report

nickeubank commented May 7, 2017

jreback commented May 7, 2017

nickeubank commented May 7, 2017

jreback commented May 7, 2017

nickeubank commented May 7, 2017

jreback commented May 7, 2017

nickeubank commented May 7, 2017

jreback commented May 7, 2017

nickeubank commented May 7, 2017

jreback commented May 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented May 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nickeubank left a comment

Choose a reason for hiding this comment

jreback commented May 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nickeubank commented May 19, 2017

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented May 7, 2017 •

edited

Loading